Skip to content

refactor: add NodeBuilder to simplify container.Node creation#198

Merged
chatton merged 4 commits into
mainfrom
chatton/simplify-container-node-creation
May 26, 2026
Merged

refactor: add NodeBuilder to simplify container.Node creation#198
chatton merged 4 commits into
mainfrom
chatton/simplify-container-node-creation

Conversation

@chatton
Copy link
Copy Markdown
Collaborator

@chatton chatton commented May 21, 2026

Summary

  • Adds a fluent NodeBuilder API to container package that handles Node + Lifecycle + Volume setup in a single Build() call
  • Converts all 12 internal callers from the 3-step pattern to the builder
  • Old NewNode, SetContainerLifecycle, and CreateAndSetupVolume marked deprecated but preserved for backwards compatibility
  • Builder supports WithVolumeName (for shared volumes) and WithHostNetwork (for host networking)
  • Extracts nodeName helpers in each package to decouple name computation from the embedded Node

Closes #131

Summary by CodeRabbit

  • Refactor

    • Moved many services to a unified builder-based container initialization for consistent setup.
    • Centralized container naming into helpers for stable, uniform names.
    • Marked legacy direct constructors deprecated in favor of the builder approach.
    • Added restart operations across chains, networks and nodes to stop/recreate containers while preserving volumes.
  • Chores

    • Added lint-suppression annotations in tests for deprecated setup APIs.
  • Bug Fixes

    • Improved Docker command execution waiting and safer key cleanup behavior.

Review Change Stack

Replace the 3-step container.Node setup (NewNode + SetContainerLifecycle +
CreateAndSetupVolume) with a fluent NodeBuilder API. Old functions are
preserved but marked deprecated for backwards compatibility.
@chatton chatton requested a review from tty47 as a code owner May 21, 2026 09:01
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

📝 Walkthrough

Walkthrough

Adds a fluent NodeBuilder API and migrates many container constructors to use it, centralizes container naming helpers, adds Restart methods for nodes/chains/networks, suppresses staticcheck in tests using deprecated APIs, and hardens docker keyring exec and cleanup behavior.

Changes

Container Node Builder Refactoring

Layer / File(s) Summary
NodeBuilder foundation
framework/docker/container/node.go
NodeBuilder fluent API with chainable configuration methods and Build(ctx, containerName) that validates inputs, initializes Node and Lifecycle, optionally enables host network, resolves volume name, and calls volume setup. NewNode and CreateAndSetupVolume annotated as deprecated comments.
Cosmos ChainNode migration
framework/docker/cosmos/chain_builder.go, framework/docker/cosmos/node.go
Inline ChainNode construction using NodeBuilder, add container home-dir defaulting and effective nodeType fallback, wrap built container Node into ChainNode, conditional genesis keyring preload for validators, add Chain.Restart/RestartNode, and extract chainNodeName; NewChainNode marked deprecated.
DataAvailability node & network
framework/docker/dataavailability/network_builder.go, framework/docker/dataavailability/node.go, framework/docker/dataavailability/network.go
NetworkBuilder.newNode and DA NewNode now use NodeBuilder with homeDir fallback; add daNodeName helper and Node.Restart; Network.Restart restarts nodes concurrently.
EVM stack node migrations
framework/docker/evstack/...
Migrate evstack chain builder and variant nodes (evmsingle, reth, spamoor, evstack root) to NodeBuilder; centralize naming helpers (evstackNodeName, evmSingleNodeName, rethNodeName, spamoorNodeName), mark constructors deprecated where applicable, and add Restart methods.
Hyperlane services migration
framework/docker/hyperlane/*
Migrate agent, forward relayer, and deployer to NodeBuilder construction, add agentNodeName, forwardRelayerNodeName, and deployerNodeName helpers, and return constructed nodes without separate lifecycle/volume setup.
IBC relayer & monitoring nodes
framework/docker/ibc/relayer/hermes.go, framework/docker/jaeger/node.go, framework/docker/victoriatraces/node.go
Migrate Hermes, Jaeger, and VictoriaTraces constructors to NodeBuilder, add centralized naming helpers, and update Name() implementations.
Test lint suppressions for deprecated APIs
framework/docker/chain_node_test.go, framework/docker/node_volume_test.go
Add //nolint:staticcheck annotations where tests still invoke deprecated node/volume APIs; no runtime changes.
Docker keyring exec & cleanup hardening
framework/docker/internal/docker_keyring.go
Resolve key records before container cleanup, refactor deleteKeyFromContainer to accept optional record and conditionally remove .address, and make execCommand poll ExecInspect until completion with a timeout and non-zero-exit handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • tty47
  • mojtaba-esk

Poem

🐰 A builder hops with tools in paw,
Nodes assemble without a flaw,
Names now tidy, lifecycles neat,
Tests hush linters, volumes meet,
Execs wait steady — build complete!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding a NodeBuilder API to simplify Node creation, which is the primary objective of this refactor.
Linked Issues check ✅ Passed The PR fully implements the requirements from issue #131: creates a fluent NodeBuilder API that encapsulates Node, Lifecycle, and Volume setup into a single Build() call, eliminating the previous three-step manual process.
Out of Scope Changes check ✅ Passed All changes align with the stated objective. Helper functions extract node naming logic, Restart methods enable node/chain restarts, and docker keyring improvements support the builder pattern—all supporting the primary NodeBuilder refactor.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chatton/simplify-container-node-creation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatton chatton marked this pull request as draft May 21, 2026 09:02
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@framework/docker/container/node.go`:
- Around line 119-134: The Build method on NodeBuilder should validate required
inputs before creating the Node: check that containerName and b.homeDir are
non-empty, and if b.hostNetwork is false ensure b.networkID is non-empty; return
a clear error (with context mentioning NodeBuilder.Build and containerName) if
any check fails. Do these checks at the top of NodeBuilder.Build (before calling
NewNode/NewLifecycle/CreateAndSetupVolume) so invalid config is rejected fast
and later calls like NewNode or CreateAndSetupVolume are not invoked with
missing values.

In `@framework/docker/cosmos/chain_builder.go`:
- Around line 493-501: The fallback branch incorrectly re-checks
nodeConfig.nodeType causing the validator path to be unreachable; when
nodeType==0, derive it from the validator flag instead of re-reading
nodeConfig.nodeType. Update the branch in chain_builder.go to: if nodeType == 0
{ if nodeConfig.Validator { nodeType = types.NodeTypeValidator } else { nodeType
= types.NodeTypeConsensusFull } }, referencing nodeConfig, nodeType,
types.NodeTypeValidator and types.NodeTypeConsensusFull to locate the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4b7ac380-e844-4471-beb9-1f5f9491c78d

📥 Commits

Reviewing files that changed from the base of the PR and between 7288006 and bb27f33.

📒 Files selected for processing (18)
  • framework/docker/chain_node_test.go
  • framework/docker/container/node.go
  • framework/docker/cosmos/chain_builder.go
  • framework/docker/cosmos/node.go
  • framework/docker/dataavailability/network_builder.go
  • framework/docker/dataavailability/node.go
  • framework/docker/evstack/chain_builder.go
  • framework/docker/evstack/evmsingle/node.go
  • framework/docker/evstack/node.go
  • framework/docker/evstack/reth/node.go
  • framework/docker/evstack/spamoor/node.go
  • framework/docker/hyperlane/agent.go
  • framework/docker/hyperlane/forward_relayer.go
  • framework/docker/hyperlane/hyperlane.go
  • framework/docker/ibc/relayer/hermes.go
  • framework/docker/jaeger/node.go
  • framework/docker/node_volume_test.go
  • framework/docker/victoriatraces/node.go

Comment thread framework/docker/container/node.go
Comment thread framework/docker/cosmos/chain_builder.go Outdated
chatton added 2 commits May 21, 2026 10:41
- Remove dead branch in cosmos ChainBuilder that checked nodeType after
  confirming it was zero. Use derived nodeType for Validator field.
- Poll ExecInspect until exec completes in docker keyring execCommand
  to prevent race between rm and subsequent container reads.
- Resolve key address before local delete so the .address file can be
  cleaned up from the container.
@chatton chatton marked this pull request as ready for review May 21, 2026 10:18
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
framework/docker/internal/docker_keyring.go (1)

440-445: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t swallow .address cleanup errors.

Line 444 ignores execCommand errors, which can leave stale address files silently and break consistency after successful key deletion. Return (or at least wrap) this error.

Suggested fix
 	if record != nil {
 		addr, err := record.GetAddress()
 		if err == nil {
 			addrFilePath := filepath.Join(d.containerKeyringDir, addr.String()+".address")
-			_ = d.execCommand(context.TODO(), []string{"rm", "-f", addrFilePath})
+			if err := d.execCommand(context.TODO(), []string{"rm", "-f", addrFilePath}); err != nil {
+				return fmt.Errorf("failed to delete address file %s: %w", addrFilePath, err)
+			}
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@framework/docker/internal/docker_keyring.go` around lines 440 - 445, The
cleanup of the ".address" file swallows errors from d.execCommand when removing
addrFilePath: after obtaining addr via record.GetAddress() and building
addrFilePath (using d.containerKeyringDir and addr.String()+".address"), check
the error returned by d.execCommand(context.TODO(), []string{"rm","-f",
addrFilePath}) and propagate it (or wrap it with context) instead of discarding;
update the surrounding function (where record, GetAddress, addrFilePath and
d.execCommand are used) to return that error so callers can handle cleanup
failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@framework/docker/internal/docker_keyring.go`:
- Around line 316-328: The exec polling loop that calls
d.dockerClient.ExecInspect with exec.ID currently blocks forever; change it to
respect context cancellation or a bounded timeout by either requiring a context
with deadline or wrapping the loop with a select that checks ctx.Done() (and/or
a timeout timer) on each iteration, and return a descriptive error if cancelled
or timed out; update the polling block around ExecInspect, inspect.Running, cmd
and ctx to abort when ctx.Done() triggers and propagate the context error
instead of looping indefinitely.

---

Outside diff comments:
In `@framework/docker/internal/docker_keyring.go`:
- Around line 440-445: The cleanup of the ".address" file swallows errors from
d.execCommand when removing addrFilePath: after obtaining addr via
record.GetAddress() and building addrFilePath (using d.containerKeyringDir and
addr.String()+".address"), check the error returned by
d.execCommand(context.TODO(), []string{"rm","-f", addrFilePath}) and propagate
it (or wrap it with context) instead of discarding; update the surrounding
function (where record, GetAddress, addrFilePath and d.execCommand are used) to
return that error so callers can handle cleanup failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 107b1110-d893-4046-a543-f125e8a7dc19

📥 Commits

Reviewing files that changed from the base of the PR and between bb27f33 and 9996951.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (5)
  • framework/docker/cosmos/chain_builder.go
  • framework/docker/cosmos/node.go
  • framework/docker/evstack/node.go
  • framework/docker/evstack/spamoor/node.go
  • framework/docker/internal/docker_keyring.go

Comment thread framework/docker/internal/docker_keyring.go
- add Restart methods to all node and chain types
- validate required inputs in NodeBuilder.Build
- add bounded timeout and context cancellation to docker keyring exec polling
- propagate address file cleanup errors instead of swallowing them
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
framework/docker/internal/docker_keyring.go (1)

316-319: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move the fallback timeout above ExecCreate/ExecStart.

The 30s wrapper is installed only after Lines 305-313, so current context.TODO()/Background() callers can still block forever in ExecCreate or ExecStart. Right now the hardening only bounds ExecInspect.

Suggested fix
 func (d *dockerKeyring) execCommand(ctx context.Context, cmd []string) error {
+	if _, hasDeadline := ctx.Deadline(); !hasDeadline {
+		var cancel context.CancelFunc
+		ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
+		defer cancel()
+	}
+
 	exec, err := d.dockerClient.ExecCreate(ctx, d.containerID, client.ExecCreateOptions{
 		Cmd: cmd,
 	})
 	if err != nil {
 		return fmt.Errorf("failed to create exec: %w", err)
@@
-	if _, hasDeadline := ctx.Deadline(); !hasDeadline {
-		var cancel context.CancelFunc
-		ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
-		defer cancel()
-	}
-
 	ticker := time.NewTicker(50 * time.Millisecond)
 	defer ticker.Stop()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@framework/docker/internal/docker_keyring.go` around lines 316 - 319, The
fallback 30s timeout is applied too late — wrap the incoming context with
context.WithTimeout before any Docker exec calls so ExecCreate and ExecStart
cannot block forever; specifically, in docker_keyring.go check ctx.Deadline()
and, if none, call ctx, cancel = context.WithTimeout(ctx, 30*time.Second) (with
defer cancel()) before invoking ExecCreate and ExecStart (the functions/methods
named ExecCreate/ContainerExecCreate, ExecStart/ContainerExecStart and before
the later ExecInspect/ContainerExecInspect), then proceed to use that wrapped
ctx for all subsequent docker client calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@framework/docker/evstack/node.go`:
- Around line 159-163: The Restart path leaks the previous gRPC connection
because Restart calls n.Remove(...) then n.Start(), which ends up reassigning
n.GrpcConn in initGRPCConnection(); before removing/recreating the container you
must close any existing connection. In Node.Restart, check if n.GrpcConn is
non-nil and call its Close() (and set n.GrpcConn = nil) prior to calling
n.Remove or before re-dialing in initGRPCConnection so the old *grpc.ClientConn
is properly closed; update initGRPCConnection/Start if needed to handle a
nil-safe assign.

In `@framework/docker/evstack/spamoor/node.go`:
- Around line 113-117: Restart currently stops then starts the container but
returns before the daemon is reachable, causing races when callers call API() or
Start(); modify Node.Restart to wait for the daemon to become reachable (the
same condition Start() relies on) before returning: after calling
n.StopContainer(ctx) and starting the container, either call the existing
Node.Start(ctx) helper (if it performs the /metrics readiness check) or
implement a small readiness loop that polls the node's /metrics (via n.API() or
the same health-check used by Start()) with a cancellable timeout from ctx and
return an error if readiness isn't reached; keep StopContainer and
StartContainer calls but ensure Restart only returns once /metrics responds.

---

Duplicate comments:
In `@framework/docker/internal/docker_keyring.go`:
- Around line 316-319: The fallback 30s timeout is applied too late — wrap the
incoming context with context.WithTimeout before any Docker exec calls so
ExecCreate and ExecStart cannot block forever; specifically, in
docker_keyring.go check ctx.Deadline() and, if none, call ctx, cancel =
context.WithTimeout(ctx, 30*time.Second) (with defer cancel()) before invoking
ExecCreate and ExecStart (the functions/methods named
ExecCreate/ContainerExecCreate, ExecStart/ContainerExecStart and before the
later ExecInspect/ContainerExecInspect), then proceed to use that wrapped ctx
for all subsequent docker client calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9dc72906-b505-434f-9c0a-dd70a47dfefe

📥 Commits

Reviewing files that changed from the base of the PR and between 9996951 and 2f729a5.

📒 Files selected for processing (11)
  • framework/docker/container/node.go
  • framework/docker/cosmos/chain.go
  • framework/docker/cosmos/node.go
  • framework/docker/dataavailability/network.go
  • framework/docker/dataavailability/node.go
  • framework/docker/evstack/evmsingle/chain.go
  • framework/docker/evstack/evmsingle/node.go
  • framework/docker/evstack/node.go
  • framework/docker/evstack/reth/node.go
  • framework/docker/evstack/spamoor/node.go
  • framework/docker/internal/docker_keyring.go

Comment on lines +159 to +163
func (n *Node) Restart(ctx context.Context) error {
if err := n.Remove(ctx, types.WithPreserveVolumes()); err != nil {
return fmt.Errorf("failed to remove container for restart: %w", err)
}
return n.Start(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -C2 '\bGrpcConn\b' framework/docker/evstack/node.go
rg -n -C2 'GrpcConn\.Close\s*\(' framework/docker/evstack

Repository: celestiaorg/tastora

Length of output: 300


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Confirm Restart() calls Remove+Start()
rg -n "func \\(n \\*Node\\) Restart\\(ctx context\\.Context\\) error" -n framework/docker/evstack/node.go
rg -n -C2 "Restart\\(ctx context\\.Context\\) error" framework/docker/evstack/node.go

# 2) Confirm Start() calls initGRPCConnection(...) on restart path
rg -n "func \\(n \\*Node\\) Start\\(ctx context\\.Context\\) error" framework/docker/evstack/node.go
rg -n "initGRPCConnection\\(" framework/docker/evstack/node.go -n
rg -n -C2 "initGRPCConnection\\(" framework/docker/evstack/node.go -n

# 3) Check whether any code closes n.GrpcConn anywhere (repo-wide)
rg -n "GrpcConn\\.Close\\s*\\(" -S .

# 4) If Close isn’t present, see if GrpcConn is replaced anywhere else without closing
rg -n -C2 "\\bn\\.GrpcConn\\s*=" framework/docker/evstack/node.go

# 5) Inspect initGRPCConnection to see if it overwrites without closing
rg -n "func \\(n \\*Node\\) initGRPCConnection\\(" framework/docker/evstack/node.go
rg -n -C2 "func \\(n \\*Node\\) initGRPCConnection\\(" framework/docker/evstack/node.go

Repository: celestiaorg/tastora

Length of output: 417


🏁 Script executed:

#!/bin/bash
set +euo pipefail

# Show Start()
echo "=== Start() ==="
rg -n "func \\(n \\*Node\\) Start\\(ctx context\\.Context\\) error" -n framework/docker/evstack/node.go
rg -n -C2 "func \\(n \\*Node\\) Start\\(ctx context\\.Context\\) error" framework/docker/evstack/node.go

# Show initGRPCConnection and where it assigns n.GrpcConn
echo "=== initGRPCConnection ==="
rg -n "func \\(n \\*Node\\) initGRPCConnection\\(" framework/docker/evstack/node.go
rg -n -C2 "initGRPCConnection\\(" framework/docker/evstack/node.go

# Show assignment to n.GrpcConn
echo "=== n.GrpcConn assignment ==="
rg -n -C2 "\\bn\\.GrpcConn\\s*=" framework/docker/evstack/node.go

# Confirm no Close() usage (exit may be 1 if none)
echo "=== GrpcConn.Close() usages (repo-wide) ==="
rg -n "GrpcConn\\.Close\\s*\\(" -S . || true

# Show possible other close patterns just in case
echo "=== any Close() on grpc.ClientConn vars ==="
rg -n "\\.Close\\s*\\(\\s*\\)" framework/docker/evstack/node.go framework/docker/evstack -S || true

Repository: celestiaorg/tastora

Length of output: 2196


🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="framework/docker/evstack/node.go"

# Show Restart -> Start(ctx) call area
echo "=== Restart() ==="
sed -n '140,200p' "$file" | nl -ba | sed 's/^/ /'

# Show Start() and initGRPCConnection call area
echo "=== Start() (around initGRPCConnection call) ==="
sed -n '180,260p' "$file" | nl -ba | sed 's/^/ /'

# Show initGRPCConnection implementation around grpc dial + assignment
echo "=== initGRPCConnection() ==="
sed -n '220,280p' "$file" | nl -ba | sed 's/^/ /'

# Show any other uses of n.GrpcConn in this file
echo "=== n.GrpcConn usages in evstack/node.go ==="
rg -n "\\bn\\.GrpcConn\\b" "$file" || true

# Check whether start/recreate uses n.externalPorts (port stability) in this file
echo "=== externalPorts usage in evstack/node.go ==="
rg -n "\\bexternalPorts\\b" "$file" || true

Repository: celestiaorg/tastora

Length of output: 142


🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="framework/docker/evstack/node.go"

echo "=== Restart() ==="
sed -n '145,175p' "$file" | cat -n

echo "=== Start() (around initGRPCConnection call) ==="
sed -n '190,245p' "$file" | cat -n

echo "=== initGRPCConnection() ==="
sed -n '220,275p' "$file" | cat -n

echo "=== n.GrpcConn usages in evstack/node.go ==="
rg -n "\\bn\\.GrpcConn\\b" "$file" || true

echo "=== externalPorts usage in evstack/node.go ==="
rg -n "\\bexternalPorts\\b" "$file" || true

Repository: celestiaorg/tastora

Length of output: 6134


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for Close on the grpc connection variable names used in this file.
rg -n "n\\.GrpcConn\\s*\\.\\s*Close\\s*\\(" -S . || true
rg -n "\\bgrpcConn\\s*\\.\\s*Close\\s*\\(" -S framework/docker/evstack || true

# Also search for Close() on anything assigned to n.GrpcConn
rg -n "GrpcConn\\b.*Close\\b|Close\\b.*GrpcConn\\b" -S . || true

Repository: celestiaorg/tastora

Length of output: 45


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "\\bGrpcConn\\b" -S .

# Look for node lifecycle methods that might close connections
rg -n "func \\(n \\*Node\\) (Stop|Close|Shutdown|Terminate|Remove|Destroy)\\b" -S framework/docker/evstack/node.go framework/docker/evstack

# Look for Close() calls on grpc client objects broadly (including not named GrpcConn)
rg -n "\\bgrpc\\.ClientConn\\b|grpcConn\\b" -S framework/docker/evstack

# Inspect evstack/node.go around any Stop/Close method if present
rg -n "func \\(n \\*Node\\) .*Close" -S framework/docker/evstack/node.go || true

Repository: celestiaorg/tastora

Length of output: 608


🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="framework/docker/evstack/node.go"

echo "=== evstack/node.go line count ==="
wc -l "$file"

echo "=== All methods on *Node in evstack/node.go ==="
rg -n "func \\(n \\*Node\\) " "$file"

echo "=== Search for Close/Shutdown patterns in evstack/node.go ==="
rg -n "\\b(n\\.(GrpcConn|.*GrpcConn|grpc.*|Grpc).*)\\.(Close|Shutdown|Stop)\\b|\\bClose\\s*\\(\\s*\\)|Shutdown\\s*\\(|Stop\\s*\\(" "$file" || true

echo "=== Show tail of file (cleanup paths might be near end) ==="
tail -n 120 "$file" | cat -n

echo "=== Search for ContainerLifecycle stop/remove hooks that might close connections ==="
rg -n "ContainerLifecycle\\.(Stop|Remove|Destroy|Close|Shutdown)" -S "$file" || true

Repository: celestiaorg/tastora

Length of output: 5621


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "func \\(n \\*Node\\) Remove\\(" -S framework/docker/evstack .
rg -n "Remove\\(ctx, types\\.WithPreserveVolumes\\(\\)\\)" -S . || true
rg -n "WithPreserveVolumes\\(\\)" -S framework/docker/evstack framework/docker | head -n 50

Repository: celestiaorg/tastora

Length of output: 1928


🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="framework/docker/container/node.go"

sed -n '180,280p' "$file" | cat -n
sed -n '280,380p' "$file" | cat -n

# Also look for GrpcConn close usage in container/node.go
rg -n "GrpcConn|grpcConn|Close\\(" "$file" || true

Repository: celestiaorg/tastora

Length of output: 5625


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Port mapping behavior
rg -n "evstackSentryPorts" -S framework/docker/evstack .

# Search for grpc.ClientConn Close on GrpcConn field (repo-wide)
rg -n "\\.GrpcConn\\.Close\\s*\\(" -S .

# Search for Close() on any grpc ClientConn variable types (broad, still targeted)
rg -n "\\b(GrpcConn|grpcConn|c\\.GrpcConn|cn\\.GrpcConn)\\b" -S framework/docker | head -n 200

Repository: celestiaorg/tastora

Length of output: 377


🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="framework/docker/evstack/node.go"

# Show evstackSentryPorts definition
sed -n '1,90p' "$file" | cat -n

# Show createEvstackContainer where usingPorts is applied
sed -n '140,220p' "$file" | cat -n

# Show the startContainer host port extraction + gRPC init
sed -n '190,260p' "$file" | cat -n

Repository: celestiaorg/tastora

Length of output: 9680


Close the old gRPC client before recreating the container.

framework/docker/evstack/node.go’s Restart() calls Remove(...) and then Start(), which dials a new *grpc.ClientConn in initGRPCConnection() and overwrites n.GrpcConn, but there is no corresponding Close() of the previous connection anywhere on the restart path.

Suggested fix
 func (n *Node) Restart(ctx context.Context) error {
+	if n.GrpcConn != nil {
+		if err := n.GrpcConn.Close(); err != nil {
+			return fmt.Errorf("failed to close grpc connection for restart: %w", err)
+		}
+		n.GrpcConn = nil
+	}
 	if err := n.Remove(ctx, types.WithPreserveVolumes()); err != nil {
 		return fmt.Errorf("failed to remove container for restart: %w", err)
 	}
 	return n.Start(ctx)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (n *Node) Restart(ctx context.Context) error {
if err := n.Remove(ctx, types.WithPreserveVolumes()); err != nil {
return fmt.Errorf("failed to remove container for restart: %w", err)
}
return n.Start(ctx)
func (n *Node) Restart(ctx context.Context) error {
if n.GrpcConn != nil {
if err := n.GrpcConn.Close(); err != nil {
return fmt.Errorf("failed to close grpc connection for restart: %w", err)
}
n.GrpcConn = nil
}
if err := n.Remove(ctx, types.WithPreserveVolumes()); err != nil {
return fmt.Errorf("failed to remove container for restart: %w", err)
}
return n.Start(ctx)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@framework/docker/evstack/node.go` around lines 159 - 163, The Restart path
leaks the previous gRPC connection because Restart calls n.Remove(...) then
n.Start(), which ends up reassigning n.GrpcConn in initGRPCConnection(); before
removing/recreating the container you must close any existing connection. In
Node.Restart, check if n.GrpcConn is non-nil and call its Close() (and set
n.GrpcConn = nil) prior to calling n.Remove or before re-dialing in
initGRPCConnection so the old *grpc.ClientConn is properly closed; update
initGRPCConnection/Start if needed to handle a nil-safe assign.

Comment on lines +113 to +117
func (n *Node) Restart(ctx context.Context) error {
if err := n.StopContainer(ctx); err != nil {
return fmt.Errorf("failed to stop container for restart: %w", err)
}
return n.StartContainer(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for the daemon to become reachable before returning.

Start() only becomes usable after /metrics responds, but Restart() returns immediately after Docker starts the container. Any caller that hits API() right after restart can race the daemon bootstrap.

Suggested fix
 func (n *Node) Restart(ctx context.Context) error {
 	if err := n.StopContainer(ctx); err != nil {
 		return fmt.Errorf("failed to stop container for restart: %w", err)
 	}
-	return n.StartContainer(ctx)
+	if err := n.StartContainer(ctx); err != nil {
+		return fmt.Errorf("failed to start container for restart: %w", err)
+	}
+	waitHTTP(fmt.Sprintf("http://127.0.0.1:%s/metrics", n.external.HTTP), 20*time.Second)
+	return nil
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (n *Node) Restart(ctx context.Context) error {
if err := n.StopContainer(ctx); err != nil {
return fmt.Errorf("failed to stop container for restart: %w", err)
}
return n.StartContainer(ctx)
func (n *Node) Restart(ctx context.Context) error {
if err := n.StopContainer(ctx); err != nil {
return fmt.Errorf("failed to stop container for restart: %w", err)
}
if err := n.StartContainer(ctx); err != nil {
return fmt.Errorf("failed to start container for restart: %w", err)
}
waitHTTP(fmt.Sprintf("http://127.0.0.1:%s/metrics", n.external.HTTP), 20*time.Second)
return nil
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@framework/docker/evstack/spamoor/node.go` around lines 113 - 117, Restart
currently stops then starts the container but returns before the daemon is
reachable, causing races when callers call API() or Start(); modify Node.Restart
to wait for the daemon to become reachable (the same condition Start() relies
on) before returning: after calling n.StopContainer(ctx) and starting the
container, either call the existing Node.Start(ctx) helper (if it performs the
/metrics readiness check) or implement a small readiness loop that polls the
node's /metrics (via n.API() or the same health-check used by Start()) with a
cancellable timeout from ctx and return an error if readiness isn't reached;
keep StopContainer and StartContainer calls but ensure Restart only returns once
/metrics responds.

@chatton chatton merged commit b955de0 into main May 26, 2026
6 checks passed
@chatton chatton deleted the chatton/simplify-container-node-creation branch May 26, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make container.Node creation less clunky

2 participants